Skip to content

[Internal] PPAF: Adds Dynamic Enablement of PPAF#5310

Merged
FabianMeiswinkel merged 29 commits intomasterfrom
users/nalutripician/ppafDynamicEnable
Aug 13, 2025
Merged

[Internal] PPAF: Adds Dynamic Enablement of PPAF#5310
FabianMeiswinkel merged 29 commits intomasterfrom
users/nalutripician/ppafDynamicEnable

Conversation

@NaluTripician
Copy link
Copy Markdown
Contributor

@NaluTripician NaluTripician commented Jul 22, 2025

Pull Request Template

Description

This pull request introduces enhancements to the partition-level failover (PPAF) functionality in the Azure Cosmos SDK. The changes include the addition of a default cross-region hedging strategy, dynamic enablement of PPAF based on the database account configuration without a restart of the SDK client, and new tests to validate these behaviors. Below are the most important changes grouped by theme:

Enhancements to Availability Strategy:

  • Added a new method SDKDefaultCrossRegionHedgingStrategy in AvailabilityStrategy.cs to provide a default hedging strategy for cross-region failover, including support for write requests on multi-region accounts.
  • Introduced an internal flag IsSDKDefaultStrategy in CrossRegionHedgingAvailabilityStrategy to differentiate SDK default strategies from custom ones. Updated the constructor to accept this flag. [1] [2] [3]

Dynamic PPAF Enablement:

  • Updated GlobalEndpointManager.cs to dynamically enable or disable PPAF based on the enablePartitionLevelFailover flag retrieved from the database account properties. Added logic to reset the availability strategy to null if PPAF is disabled and no custom strategy is set. [1] [2]

Default Hedging Thresholds:

  • Changed the visibility of DefaultHedgingThresholdInMilliseconds and DefaultHedgingThresholdStepInMilliseconds in DocumentClient.cs from private to internal for broader accessibility within the SDK.
  • Updated the initialization logic in InitializePartitionLevelFailoverWithDefaultHedging to use the new SDKDefaultCrossRegionHedgingStrategy.

Tests for PPAF Functionality:

  • Added a new integration test ReadItemAsync_WithPPAFDynamicOverride in CosmosItemIntegrationTests.cs to validate dynamic PPAF enablement, hedging behavior, and fallback when PPAF is disabled. This includes fault injection and diagnostics validation.

End to End Validation:

CreateItemAsync: Strong Consistency Account with Direct Mode:

Screenshot 2025-08-07 162428

CreateItemAsync: Strong Consistency Account with Gateway Mode:

image

CreateItemAsync: Session Consistency Account with Direct Mode:

image

CreateItemAsync: Session Consistency Account with Gateway Mode:

image

[Note: The Orange graph indicates the number of requests processed in the North CentralUS region, where as the Blue graph indicates the number of requests processed in the Central US region.]

Type of change

Please delete options that are not relevant.

  • New feature (non-breaking change which adds functionality)

Closing issues

To automatically close an issue: closes #5304

Comment thread Microsoft.Azure.Cosmos/src/Routing/GlobalEndpointManager.cs Outdated
@kundadebdatta kundadebdatta marked this pull request as draft July 30, 2025 01:53
Comment thread Microsoft.Azure.Cosmos/src/Routing/GlobalEndpointManager.cs Outdated
Comment thread Microsoft.Azure.Cosmos/src/Routing/GlobalPartitionEndpointManagerCore.cs Outdated
Comment thread Microsoft.Azure.Cosmos/src/Routing/GlobalEndpointManager.cs
Comment thread Microsoft.Azure.Cosmos/src/Routing/AvailabilityStrategy/AvailabilityStrategy.cs Outdated
Comment thread Microsoft.Azure.Cosmos/src/DocumentClient.cs Outdated
Comment thread Microsoft.Azure.Cosmos/src/Routing/GlobalPartitionEndpointManager.cs Outdated
@kundadebdatta kundadebdatta self-assigned this Aug 7, 2025
ananth7592
ananth7592 previously approved these changes Aug 8, 2025
Comment thread Microsoft.Azure.Cosmos/src/UserAgentContainer.cs
Copy link
Copy Markdown
Member

@FabianMeiswinkel FabianMeiswinkel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM except for one small question

Comment thread Microsoft.Azure.Cosmos/src/UserAgentContainer.cs Outdated
Copy link
Copy Markdown
Member

@FabianMeiswinkel FabianMeiswinkel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM except for the regex compilation comment

Comment thread Microsoft.Azure.Cosmos/src/UserAgentContainer.cs Outdated
Copy link
Copy Markdown
Member

@FabianMeiswinkel FabianMeiswinkel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - Thanks!

@kundadebdatta kundadebdatta changed the title PPAF: Adds Dynamic Enablement of PPAF [Internal] PPAF: Adds Dynamic Enablement of PPAF Aug 12, 2025
@NaluTripician NaluTripician added auto-merge Enables automation to merge PRs and removed auto-merge Enables automation to merge PRs labels Aug 13, 2025
@FabianMeiswinkel FabianMeiswinkel merged commit 9fc8b0a into master Aug 13, 2025
28 checks passed
@FabianMeiswinkel FabianMeiswinkel deleted the users/nalutripician/ppafDynamicEnable branch August 13, 2025 13:04
ananth7592 added a commit that referenced this pull request May 1, 2026
…ind PPAF

When ExcludeRegions filters out all preferred read regions and PPAF
(Partition Level Failover) is enabled, GetApplicableEndpoints now falls back
to WriteEndpoints[0] (dynamic, tracks current write region) instead of
this.defaultEndpoint (static, region-agnostic URI set once at init).

The fix is gated behind isPartitionLevelFailoverEnabled (Func<bool>) wired
from ConnectionPolicy.EnablePartitionLevelFailover through GlobalEndpointManager,
supporting dynamic enablement per PR #5310.

When PPAF is disabled, original behavior (defaultEndpoint fallback) is preserved.

Fixes #5821

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
ananth7592 added a commit that referenced this pull request May 4, 2026
… enabled and all regions excluded (#5823)

## Problem

When ApplicationPreferredRegions == ExcludeRegions,
LocationCache.GetApplicableEndpoints falls back to his.defaultEndpoint —
a static, region-agnostic URI set once at init and never updated. After
a write region (hub) switch, the GlobalAddressResolver's cached
EndpointCache for this default endpoint has a stale
AddressResolver.location, causing incorrect region tracking in
diagnostics, per-partition routing, and retry logic.

## Fix

When PPAF (IsPartitionLevelFailoverEnabled) is enabled,
GetApplicableEndpoints now uses WriteEndpoints[0] (dynamic, tracks
current write region) as the read fallback instead of
his.defaultEndpoint.

This aligns with:
- UpdateLocationCache (L756-760) which already uses WriteEndpoints[0]
for ReadEndpoints fallback
- Java SDK: writeRegionalRoutingContexts.get(0)
- Python SDK: get_write_regional_routing_contexts()[0]

### PPAF Gating
The fix is gated behind Func<bool> isPartitionLevelFailoverEnabled wired
from ConnectionPolicy.EnablePartitionLevelFailover through
GlobalEndpointManager, supporting dynamic enablement per PR #5310. When
PPAF is disabled, original behavior (defaultEndpoint fallback) is
preserved.

## Changes
- **LocationCache.cs**: Added isPartitionLevelFailoverEnabled parameter;
gated read fallback behind it
- **GlobalEndpointManager.cs**: Wires
ConnectionPolicy.EnablePartitionLevelFailover into LocationCache
- **LocationCacheTests.cs**: 3 new tests covering PPAF on/off/dynamic
toggle scenarios

## Testing
All 94 LocationCacheTests pass.

Fixes #5821

---------

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Debdatta Kunda <87335885+kundadebdatta@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

auto-merge Enables automation to merge PRs PerPartitionAutomaticFailover

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Per Partition Automatic Failover] - Enable PPAF Dynamically upon change on Account Properties Metadata Response

5 participants